Exact Asymptotic Results for a Model of Sequence Alignment 
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Finding analytically the statistics of the longest common subsequence (LCS) of a pair of random 
sequences drawn from c alphabets is a challenging problem in computational evolutionary biology. 
We present exact asymptotic results for the distribution of the LCS in a simpler, yet nontrivial, 
variant of the original model called the Bernoulli matching (BM) model which reduces to the original 
model in the c — > oo limit. We show that in the BM model, for all c, the distribution of the 
asymptotic length of the LCS, suitably scaled, is identical to the Tracy- Widom distribution of the 
largest eigenvalue of a random matrix whose entries are drawn from a Gaussian unitary ensemble. 
In particular, in the c — > oo limit, this provides an exact expression for the asymptotic length 
distribution in the original LCS problem. 

PACS numbers: 87.10. +e, 87.15. Cc, 02.50.-r, 05.40.-a 
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Sequence alignment is one of the most useful quan- 
titative methods used in evolutionary molecular biol- 
ogy [1-3]. The goal of an alignment algorithm is to 
search for similarities in patterns in different sequences. 
A classic and much studied alignment problem is the 
so called 'longest common subsequence' (LCS) problem. 
The input to this problem is a pair of sequences a = 
{«!, a 2 , a,} (of length i) and (3 = {Pi,/3 2 , . . . , /3,} (of 
length j). For example, a and (3 can be two random se- 
quences of the 4 base pairs A, C, G, T of a DNA molecule, 
e.g., a = {A,C,G,C,T,A,C} and (3 = {C,T,G,A,C}. 
A subsequence of a is an ordered sublist of a (entries of 
which need not be consecutive in a), e.g, {C,G,T,C}, 
but not {T,G,C}. A common subsequence of two se- 
quences a and (3 is a subsequence of both of them. For 
example, the subsequence {C, G, A, C} is a common sub- 
sequence of both a and f3. There can be many possible 
common subsequences of a pair of sequences. The aim 
of the LCS problem is to find the longest of such com- 
mon subsequences. This problem and its variants have 
been widely studied in biology [4-7], computer science 
[8-10,2], probability theory [11-16] and more recently in 
statistical physics [17-19]. A particularly important ap- 
plication of the LCS problem is to quantify the closeness 
between two DNA sequences. In evolutionary biology, 
the genes responsible for building specific proteins evolve 
with time and by finding the LCS of the same gene in 
different species, one can learn what has been conserved 
in time. Also, when a new DNA molecule is sequenced in 
vitro, it is important to know whether it is really new or 
it already exists. This is achieved quantitatively by mea- 
suring the LCS of the new molecule with another existing 
already in the database. 

For a pair of fixed sequences of length i and j respec- 
tively, the length L^j of their LCS is just a number. 
However, in the stochastic version of the LCS problem 
one compares two random sequences drawn from c al- 
phabets and hence the length Lij is a random variable. 



A major challenge over the last three decades has been 
to determine the statistics of Lij [11-15]. For equally 
long sequences (i — j = n), it has been proved that 
(L n ,n) ~ 7c" for n 3> 1, where the averaging is performed 
over all realizations of the random sequences. The con- 
stant 7 C is known as the Chvatal-Sankoff constant which, 
to date, remains undetermined though there exists sev- 
eral bounds [12,14,15], a conjecture due to Steele [13] that 
7 C = 2/(1 + y/c) and a recent proof [16] that 7 C — > 2/^/c 
as c — > oo. Unfortunately, no exact results are available 
for the finite size corrections to the leading behavior of 
the average (L n , n ), for the variance, and also for the full 
probability distribution of L n>n . Thus, despite tremen- 
dous analytical and numerical efforts, exact solution of 
the random LCS problem has, so far, remained elusive. 
Therefore it is important to find other variants of this 
LCS problem that may be analytically tractable. 

Computationally, the easiest way to determine the 
length Lij of the LCS of two arbitrary sequences of 
lengths i and j (in polynomial time ~ 0(ij)) is via using 
the recursive algorithm [2,19] 



max [Li-ij, Lij-i, Li-\j-\ + r)i : 



subject to the initial conditions Li 
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0. 



The variable rjij is either 1 when the characters at the 
positions i (in the sequence a) and j (in the sequence 
0) match each other, or if they do not. Note that the 
variables r?i,j's are not independent of each other. To 
see this consider the simple example - matching of two 
strings a = AB and (3 — AA. One has by definition: 
771,1 = 771,2 = 1 and 772,1 = 0. The knowledge of these 
three variables is sufficient to predict that the last two 
letters will not match, i.e., 772,2 = 0. Thus, 772,2 can 
not take its value independently of ?7i 1, 771,2 7 ^2,1 ■ These 
residual correlations between the r\ij variables make the 
LCS problem rather complicated. Note however that for 
two random sequences drawn from c alphabets, these cor- 
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relations between the rjij variables vanish in the c — > oo 
limit. 

A simpler but natural variant of this LCS problem is 
the Bernoulli matching (BM) model where one ignores 
the correlations between Jfoj's for all c [19]. The BM 
model reduces to the original LCS problem only in the 
c — > oo limit. The length Lf^ 1 of the BM model satisfies 
the same recursion relation in Eq. (1) except that %,j's 
are now independent and each drawn from the bimodal 
distribution: p(rj) = (l/c)<5,,,i + (1 - l/c)5 vfi . The BM 
model, though simpler than the original LCS problem, is 
still nontrivial due to the nonlinear recursion relation in 
Eq. (1). Using the cavity method of spin glass physics 
[20] , the asymptotic behavior of the average length in the 
BM model was determined analytically [19], 
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where 7^ = 2/(1 + y/c), same as the conjectured value 
of the Chvatal-Sankoff constant 7 C for the original LCS 
model. However, other properties such as the variance 
or the distribution of remained untractable even in 
the BM model. 

The purpose of this Letter is to present an exact 
asymptotic formula for the distribution of the length 
L„ „ in the BM model for all c. Our main result is that 
for large n, 
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where x is a random variable with a n-independent distri- 
bution, Prob(x < x) = Fxwf 1 ) which is the well studied 
Tracy- Widom distribution for the largest eigenvalue of a 
random matrix with entries drawn from a Gaussian uni- 
tary ensemble [21]. For a detailed form of the function 
FtwW) see [21]- We show that for all c, 
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This allows us to calculate the average including the sub- 
leading finite size correction term and the variance of 
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for large n, 

(i^ / >-7f M -+(x>/(c)n 1 / 3 
Var^>(( X 2 )-( X ) 2 ) f\c)n 



2/3 



(5) 



where one can use the known exact values [21], (x) — 
-1.7711 . . . and (x 2 ) - (x) 2 = 0.8132 .... In particular, 
we note that in the limit c — > 00, Eqs. (3)- (5) provide 
exact asymptotic results for the original LCS model as 
well. 

In the BM model, the length Lfj can be interpreted 
as the height of a surface over the 2-d plane con- 

structed via the recursion relation in Eq. (1). A typical 
surface, shown in Fig. (la), has terrace-like structures. 
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FIG. 1. 



Examples of (a) BM surface L 



(b) ADP surface Lff p = h(x,y). 
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It is useful to consider the projection of the level lines 
separating the adjacent terraces whose heights differ by 
1 (see Fig. 2) onto the 2-d (i, j) plane. Note that, by the 
rule Eq. (1), these level lines never overlap each other, 
i.e., no two paths have any common edge. The statis- 
tical weight of such a projected 2-d configuration is the 
product of weights associated with the vertices of the 2- 
d plane. There are five types of possible vertices with 
nonzero weights as shown in Fig. 2, where p — 1/c and 
q = 1 — p. Since the level lines never cross each other, 
the weight of the first vertex in Fig. (2) is 0. 
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FIG. 2. Projected 2-d level lines separating adjacent ter- 
races of unit height difference in the BM surface in Fig. (la). 
The adjacent table shows the weights of all vertices on the 
2-d plane. 

Consider first the limit c — * 00 (i.e., p — * 0). The 
weights of all allowed vertices are 1, except the ones 
shown by black dots in Fig. (2), whose associated weights 
are p — * 0. The number N of these black dots inside a 
rectangle of area A = ij can be easily estimated. For 
large A and p — > 0, this number is Poisson distributed 
with the mean N = pA. The Bethe ansatz analysis shows 
that BM corresponds to the sector of the 5- vertex model 
[22] where the density a of empty edges in a row of ver- 
tical edges is close to the boundary a « 1~. The careful 
examination of the free energy near this boundary allows 
one to conclude that the leading contribution in p (for 
p — > 0) to N comes exactly from the line of phase tran- 
sitions in a 5-vertex model. The subleading corrections 
to N are of order ~ p 3 / 2 and are ensured by small de- 
viations from the critical line being beyond the Poisson 
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approximation [23]. 

The height LfJ 1 is just the number of level lines N 
inside this rectangle of area A = ij. The problem of esti- 
mating Af has recently appeared in a number of interface 
models such as a polynuclear growth model [24] and a 
ballistic deposition model [25]. By using a mapping to 
the longest increasing subsequence (LIS) of the equally 
likely permutations of a set of integers and then, by ap- 
plying a celebrated result due to Baik, Deift and Johans- 
son (BDJ) [26], it was shown [24,25] that the number of 
level lines Af inside the rectangle (for large A), appropri- 
ately scaled, has a limiting behavior, Af — ► + N 1 ^ \i 
where % is a random variable with Tracy- Widom distri- 
bution. Using N = pA = ij/c, one then obtains in the 
limit p — > 0, 
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In particular, for large equal length sequences i = j = n, 
we get for c — > oo 



(7) 



Note that since the BM and the original LCS model are 
equivalent in the limit c — > oo, the exact results in Eqs. 
(6)-(7) also hold for the LCS model. Note that only the 
leading behavior of the average (£ n ,n) was known before 
[16] in the c — > oo limit of the original LCS model. 

For finite c, while the above mapping to the LIS prob- 
lem still works, the corresponding permutations of the 
LIS problem are not generated with equal probability 
and hence one can no longer use the BDJ results. To 
make progress for finite c, we map the BM model exactly 
to a 3-d anisotropic directed percolation (ADP) model 
first considered by Rajesh and Dhar [27]. This ADP 
model can further be mapped to a (1 + l)-d directed 
polymer problem studied by Johansson [28] . For this spe- 
cific directed polymer problem, Johansson derived exact 
asymptotic result for the distribution of the polymer en- 
ergy. Translating these results back to the BM model, 
we derive our main results in Eqs. (3)-(5). Note that 
the recursion relation in Eq. (1) can also be viewed as 
a (1 + l)-d directed polymer problem [18,19] and some 
asymptotic results (such as the 0(n 2 / 3 ) behavior of the 
variance of L n , n for large n) can be obtained using the 
arguments of universality [18]. However, this does not 
provide precise results for the full distribution which are 
obtained here. 

Let us consider a directed bond percolation on a simple 
cubic lattice. The bonds are occupied with probabilities 
p x , p y , and p z along the x, y and z axes and are all di- 
rected towards increasing coordinates. Imagine a source 
of fluid at the origin which spreads along the occupied 
directed bonds. The sites that get wet by the fluid form 



a 3-d cluster. In the ADP problem, the bond occupa- 
tion probabilities are anisotropic, p x = p y = 1 (all bonds 
aligned along the x and y axes are occupied) and p z = p. 
Hence, if the point (x,y,z) gets wet by the fluid then 
all the points (x',y',z) on the same plane with x' > x 
and y' > y also get wet. Such a wet cluster is compact 
and can be characterized by its bounding surface height 
h(x, y) as shown in Fig. (lb). It is not difficult to see that 
the height h(x,y) satisfies the following recursion relation 
[27], 

h(x,y) = max [h(x - l,y),h(x,y- 1)] (8) 

where £ij's are i.i.d. random variables taking nonncga- 
tivc integer values with Prob(£, ; j = k) = (1 —p)p k for 
k = 0, 1, 2, . . .. One can also interpret the height h(x, y) 
in Eq. (8) as the energy of a directed polymer in the 
(x — y) plane. Precisely this particular version of the 
polymer problem was studied by Johansson [28] who ob- 
tained the asymptotic distribution of the height for large 
x and y, 

ur , 2^/pxy + p(x + y) 
h(x, y) -» + 



+ 



(pxy) 



1/6 



( 1 +p) + A/^(^ + y) 



2/3 



X, (9) 



where g = 1 — p, x is a random variable with a Tracy- 
Widom distribution. 

While the terrace-like structures of the ADP surface 
look similar to the BM surfaces (compare Figs. (la) and 
(lb)), there is an important difference between the two. 
In the ADP model, the level lines separating two adja- 
cent terraces can overlap with each other [27] , which does 
not happen in the BM model. However, by making the 
following change of coordinates in the ADP model [27] 



( = x + h(x,y); r] = y + h(x,y) 



(10) 



one gets a configuration of the surface where the level 
lines no longer overlap. Moreover, it is not difficult to 
show that the projected 2-d configuration of level lines 
of this shifted ADP surface has exactly the same statis- 
tical weight as the projected 2-d configuration of the BM 
surface. Denoting the BM height by h(x,y) — L^^f , one 
then has the identity, h((,rj) — h(x,y), which holds for 
each configuration. Using Eq. (10), one can rewrite this 
identity as 

MC,??) = /i(c-MC^),r?-MC,??)). (11) 

Thus, for any given height function h(x, y) of the ADP 
model, one can, in principle, obtain the corresponding 
height function h(x, y) for all (x, y) of the BM model by 
solving the nonlinear equation (11). This is however very 
difficult in practice. Fortunately, one can make progress 
for large (x, y) where one can replace the integer val- 
ued discrete heights by continuous functions h(x,y) and 



3 



h(x, y). Using the notation d x = d/dx it is easy to derive 
from Eq. (10) the following pair of identities, 



d x h = 



d,,h 



d v h 



1 — h — d v h ' 1 — dch, — d n h 

In a similar way, one can show that 



d c h 



d x h ~ 

l + d X h + dyh' ^ 



dyh 



1 + d x h + d v h' 



(12) 



(13) 



We then observe that Eqs. (12) and (13) are invariant 
under the simultaneous transformations 



( -> -x; T] > y; h^h. 



(14) 



Since the height is built up by integrating the derivatives, 
this leads to a simple result for large £ and rj, 



h(c,v) = h(-C,- v ). 



(15) 



Thus, if we know exactly the functional form of the 
ADP surface h(x, y), then the functional form of the BM 
surface h(x, y) for large x and y is simply obtained by 
h(x,y) = h(—x, —y). Changing x — > — x and y —* — y in 
Johansson's expression for the ADP surface in Eq. (9) 
we thus arrive at our main asymptotic result for the BM 
model 
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(pxy) 1 ^ 



2yfpxy - p(x + y) 
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-i 2/3 



X, (16) 



where p = 1/c and g = 1 — 1/c. For equal length se- 
quences x = y = n, Eq. (16) then reduces to Eq. (3). 

To check the consistency of our asymptotic results, we 
further computed the difference between the left- and the 
right-hand sides of Eq. (11), 

Ah(C, rf) = h((, v)-h{(- h(C, r?), v - MO vj) , (17) 

with the functions h(x, y) and h(x, y) given respectively 
by Eqs. (9) and (16). For large ( = i] one gets 



AMCC) - k /3 x 2 /3(i-vp) 4/3 



C 



-1/3 



(18) 



Thus the discrepancy falls off as a power law for large 
indicating that indeed our solution is asymptotically 
exact. We have also performed numerical simulations of 
the BM model using the recursion relation in Eq. (1) for 
c = 2, 4, 9, 16, 100. Our preliminary results [23] for rela- 
tively small system sizes (up to n = 5000) are consistent 
with our exact results in Eqs. (3)-(5). 

The Tracy- Widom distribution of the random matrix 
theory has appeared recently in a number of problems 
[21,29,28,24,25]. In this Letter, we have shown that it 
also describes the asymptotic distribution of the length of 
the longest common subsequence in a sequence matching 



problem. While a possible link between the two problems 
was speculated before [29], a precise connection, so far, 
was missing and is provided here. 
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